{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "5e6c51b1",
   "metadata": {},
   "source": [
    "# Lesson 6: Automatic Speech Recognition"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac90500c",
   "metadata": {},
   "source": [
    "- In the classroom, the libraries are already installed for you.\n",
    "- If you would like to run this code on your own machine, you can install the following:\n",
    "``` \n",
    "    !pip install transformers\n",
    "    !pip install -U datasets\n",
    "    !pip install soundfile\n",
    "    !pip install librosa\n",
    "    !pip install gradio\n",
    "```\n",
    "\n",
    "The `librosa` library may need to have [ffmpeg](https://www.ffmpeg.org/download.html) installed. \n",
    "- This page on [librosa](https://pypi.org/project/librosa/) provides installation instructions for ffmpeg."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9d654d27-1e91-416c-be34-37e830800585",
   "metadata": {},
   "source": [
    "- Here is some code that suppresses warning messages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3f3f4054-9a14-4f41-820f-e07197890390",
   "metadata": {
    "height": 47
   },
   "outputs": [],
   "source": [
    "from transformers.utils import logging\n",
    "logging.set_verbosity_error()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61707b14",
   "metadata": {},
   "source": [
    "### Data preparation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8e11e596",
   "metadata": {
    "height": 30,
    "tags": []
   },
   "outputs": [],
   "source": [
    "from datasets import load_dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1b0a5065-8210-435d-a04a-1bda7e8bf596",
   "metadata": {
    "height": 81
   },
   "outputs": [],
   "source": [
    "dataset = load_dataset(\"librispeech_asr\",\n",
    "                       split=\"train.clean.100\",\n",
    "                       streaming=True,\n",
    "                       trust_remote_code=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dc38a0be-db12-46f3-ba59-6a6f990423c4",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "example = next(iter(dataset))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c54e08d6-4b68-4d87-8a8c-f28b5981a854",
   "metadata": {
    "height": 47
   },
   "outputs": [],
   "source": [
    "dataset_head = dataset.take(5)\n",
    "list(dataset_head)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ba094634-00a2-469b-8862-ee7d68e72d86",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "list(dataset_head)[2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c2932356-2ea0-4797-a1eb-d0cd30cacedc",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1ff63caa-aa7b-4da2-9317-4ec30999c0af",
   "metadata": {
    "height": 81
   },
   "outputs": [],
   "source": [
    "from IPython.display import Audio as IPythonAudio\n",
    "\n",
    "IPythonAudio(example[\"audio\"][\"array\"],\n",
    "             rate=example[\"audio\"][\"sampling_rate\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5fda63f6",
   "metadata": {},
   "source": [
    "### Build the pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cb518e06",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "from transformers import pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "15499e9e-d3c1-4010-997a-2be716c241c7",
   "metadata": {
    "height": 47
   },
   "outputs": [],
   "source": [
    "asr = pipeline(task=\"automatic-speech-recognition\",\n",
    "               model=\"./models/distil-whisper/distil-small.en\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2755a2f",
   "metadata": {},
   "source": [
    "Info about [distil-whisper/distil-small.en](https://huggingface.co/distil-whisper)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ea81f743-6e8b-4021-b9b1-1d1873b88fd0",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "asr.feature_extractor.sampling_rate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "18ae7edb-d852-4fd7-adcf-c0274b494a4b",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "example['audio']['sampling_rate']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "786cca9b-1fe0-404c-baf3-26e03a6d1ab1",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "asr(example[\"audio\"][\"array\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "05e5a0ef-81b8-4e4d-8a15-cd063860a9de",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "example[\"text\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac81083e",
   "metadata": {},
   "source": [
    "### Build a shareable app with Gradio\n",
    "\n",
    "### Troubleshooting Tip\n",
    "- Note, in the classroom, you may see the code for creating the Gradio app run indefinitely.\n",
    "  - This is specific to this classroom environment when it's serving many learners at once, and you won't wouldn't experience this issue if you run this code on your own machine.\n",
    "- To fix this, please restart the kernel (Menu Kernel->Restart Kernel) and re-run the code in the lab from the beginning of the lesson."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0cfe9264",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import gradio as gr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1d17e293-4ade-4f9a-891b-177abd659206",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "demo = gr.Blocks()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "adff9b2c-563a-45c2-8ed2-d65a1c281ffa",
   "metadata": {
    "height": 115
   },
   "outputs": [],
   "source": [
    "def transcribe_speech(filepath):\n",
    "    if filepath is None:\n",
    "        gr.Warning(\"No audio found, please retry.\")\n",
    "        return \"\"\n",
    "    output = asr(filepath)\n",
    "    return output[\"text\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dc6df902-8d7b-47c6-ac23-745d4ff1f942",
   "metadata": {
    "height": 132
   },
   "outputs": [],
   "source": [
    "mic_transcribe = gr.Interface(\n",
    "    fn=transcribe_speech,\n",
    "    inputs=gr.Audio(sources=\"microphone\",\n",
    "                    type=\"filepath\"),\n",
    "    outputs=gr.Textbox(label=\"Transcription\",\n",
    "                       lines=3),\n",
    "    allow_flagging=\"never\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4cf8c807",
   "metadata": {},
   "source": [
    "To learn more about building apps with Gradio, you can check out the short course: [Building Generative AI Applications with Gradio](https://www.deeplearning.ai/short-courses/building-generative-ai-applications-with-gradio/), also taught by Hugging Face."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "11982cc9-e13d-4ef8-87c6-de529ea9d45f",
   "metadata": {
    "height": 149
   },
   "outputs": [],
   "source": [
    "file_transcribe = gr.Interface(\n",
    "    fn=transcribe_speech,\n",
    "    inputs=gr.Audio(sources=\"upload\",\n",
    "                    type=\"filepath\"),\n",
    "    outputs=gr.Textbox(label=\"Transcription\",\n",
    "                       lines=3),\n",
    "    allow_flagging=\"never\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ad06a929-8f7b-4c1d-a224-4fbcd807b55a",
   "metadata": {
    "height": 183
   },
   "outputs": [],
   "source": [
    "with demo:\n",
    "    gr.TabbedInterface(\n",
    "        [mic_transcribe,\n",
    "         file_transcribe],\n",
    "        [\"Transcribe Microphone\",\n",
    "         \"Transcribe Audio File\"],\n",
    "    )\n",
    "\n",
    "demo.launch(share=True, \n",
    "            server_port=int(os.environ['PORT1']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4764e5a0-c8d0-4801-8ee9-a26e0b40135f",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "demo.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b4befa9",
   "metadata": {},
   "source": [
    "## Note: Please stop the demo before continuing with the rest of the lab.\n",
    "- The app will continue running unless you run\n",
    "  ```Python\n",
    "  demo.close()\n",
    "  ```\n",
    "- If you run another gradio app (later in this lesson) without first closing this appp, you'll see an error message:\n",
    "  ```Python\n",
    "  OSError: Cannot find empty port in range\n",
    "  ```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9ca98d74",
   "metadata": {},
   "source": [
    "* Testing with a longer audio file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a26fe227-f981-4399-8ff3-ce99647168cf",
   "metadata": {
    "height": 47
   },
   "outputs": [],
   "source": [
    "import soundfile as sf\n",
    "import io"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5adbc33f",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "audio, sampling_rate = sf.read('narration_example.wav')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5cad2ecc",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "sampling_rate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "82821e61",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "asr.feature_extractor.sampling_rate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3d6083dd-d874-4e9b-9a7d-26a2f1d154f1",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "asr(audio)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0cbdc7bd",
   "metadata": {},
   "source": [
    "_Note:_ Running the cell above will return:\n",
    "```\n",
    "ValueError: We expect a single channel audio input for AutomaticSpeechRecognitionPipeline\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62c10b2f",
   "metadata": {},
   "source": [
    "* Convert the audio from stereo to mono (Using librosa)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1751973f-aeee-497a-8c21-c51d971c7fc4",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "audio.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ce73debe-ebdb-403d-bd23-608f5823a496",
   "metadata": {
    "height": 64
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "audio_transposed = np.transpose(audio)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "622f2cea-255a-4b7b-b322-688ada953c3b",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "audio_transposed.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d9d63aee-d3f4-48d6-b4d3-c0249c4e8732",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "import librosa"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ae60766f-880c-41fd-8ace-8d79d4bf3747",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "audio_mono = librosa.to_mono(audio_transposed)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ed7c0cb4-64d9-4d55-814e-6b0893e5001c",
   "metadata": {
    "height": 47
   },
   "outputs": [],
   "source": [
    "IPythonAudio(audio_mono,\n",
    "             rate=sampling_rate)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3d73d951-9955-497a-b6b5-a908cde5e4f5",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "asr(audio_mono)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25ccb846",
   "metadata": {},
   "source": [
    "_Warning:_ The cell above might throw a warning because the sample rate of the audio sample is not the same of the sample rate of the model.\n",
    "\n",
    "Let's check and fix this!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2edf91f6-48b3-405b-8b3c-1aa9d07120db",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "sampling_rate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6c7cb71f-52f0-4bfa-9a63-c99793310755",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "asr.feature_extractor.sampling_rate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2dce1909-96b9-41bc-9e3f-da14c9568401",
   "metadata": {
    "height": 64
   },
   "outputs": [],
   "source": [
    "audio_16KHz = librosa.resample(audio_mono,\n",
    "                               orig_sr=sampling_rate,\n",
    "                               target_sr=16000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "40415a12-e2c6-4bb3-8bc1-8108c57ac7b3",
   "metadata": {
    "height": 115
   },
   "outputs": [],
   "source": [
    "asr(\n",
    "    audio_16KHz,\n",
    "    chunk_length_s=30, # 30 seconds\n",
    "    batch_size=4,\n",
    "    return_timestamps=True,\n",
    ")[\"chunks\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "713cb23a",
   "metadata": {},
   "source": [
    "* Build the Gradio interface."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e891c095-e8f3-4917-8d91-4f8c5f94fcfb",
   "metadata": {
    "height": 47
   },
   "outputs": [],
   "source": [
    "import gradio as gr\n",
    "demo = gr.Blocks()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f3f27d78-de74-45a1-96e8-ebd9b0d367f4",
   "metadata": {
    "height": 200
   },
   "outputs": [],
   "source": [
    "def transcribe_long_form(filepath):\n",
    "    if filepath is None:\n",
    "        gr.Warning(\"No audio found, please retry.\")\n",
    "        return \"\"\n",
    "    output = asr(\n",
    "      filepath,\n",
    "      max_new_tokens=256,\n",
    "      chunk_length_s=30,\n",
    "      batch_size=8,\n",
    "    )\n",
    "    return output[\"text\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9eec7801-aab9-4014-85f4-5b8c01ab4edd",
   "metadata": {
    "height": 285
   },
   "outputs": [],
   "source": [
    "mic_transcribe = gr.Interface(\n",
    "    fn=transcribe_long_form,\n",
    "    inputs=gr.Audio(sources=\"microphone\",\n",
    "                    type=\"filepath\"),\n",
    "    outputs=gr.Textbox(label=\"Transcription\",\n",
    "                       lines=3),\n",
    "    allow_flagging=\"never\")\n",
    "\n",
    "file_transcribe = gr.Interface(\n",
    "    fn=transcribe_long_form,\n",
    "    inputs=gr.Audio(sources=\"upload\",\n",
    "                    type=\"filepath\"),\n",
    "    outputs=gr.Textbox(label=\"Transcription\",\n",
    "                       lines=3),\n",
    "    allow_flagging=\"never\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d97e0b2d-8745-406c-b9c3-32b8e76f9512",
   "metadata": {
    "height": 166
   },
   "outputs": [],
   "source": [
    "with demo:\n",
    "    gr.TabbedInterface(\n",
    "        [mic_transcribe,\n",
    "         file_transcribe],\n",
    "        [\"Transcribe Microphone\",\n",
    "         \"Transcribe Audio File\"],\n",
    "    )\n",
    "demo.launch(share=True, \n",
    "            server_port=int(os.environ['PORT1']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e0484ebb",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "demo.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d01b5aa7",
   "metadata": {
    "height": 166
   },
   "source": [
    "## Note: Please stop the demo before continuing with the rest of the lab.\n",
    "- The app will continue running unless you run\n",
    "  ```Python\n",
    "  demo.close()\n",
    "  ```\n",
    "- If you run another gradio app (later in this lesson) without first closing this appp, you'll see an error message:\n",
    "  ```Python\n",
    "  OSError: Cannot find empty port in range\n",
    "  ```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dbfaa6fb",
   "metadata": {},
   "source": [
    "### Try it yourself!\n",
    "- Try this model with your own audio files!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e20cb142-ec22-4c55-adc7-17d81ded715f",
   "metadata": {
    "height": 81,
    "tags": []
   },
   "outputs": [],
   "source": [
    "import soundfile as sf\n",
    "import io\n",
    "\n",
    "audio, sampling_rate = sf.read('narration_example.wav')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6f9279ef-654b-4f8a-9be0-15cbef926b90",
   "metadata": {
    "height": 30,
    "tags": []
   },
   "outputs": [],
   "source": [
    "sampling_rate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4341c526-c5aa-4ccf-a426-98ed5efeaeac",
   "metadata": {
    "height": 30,
    "tags": []
   },
   "outputs": [],
   "source": [
    "asr.feature_extractor.sampling_rate"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
